PPOL 6801: Text as Data: Computational Linguistics
2025-04-03
Motivating idea:
Embeddings offer a compelling empirical technique to capture word context. Measures of context can help answer questions like:
How has the meaning of a word changed through time?
How do different groups use the same word differently?
Motivating issues:
Training embeddings is computationally expensive & words/phrases of interest for researchers often are highly specific and may not have much data on which to train embeddings
Current embedding techniques generally do not provide an inference mechanism (i.e., hypothesis testing)
Propose a framework that:
Illustrate via three use cases for framework:
Choose focal word(s) of interest from corpus
Choose covariates of interest
Use à la carte embeddings to estimate context-specific embeddings for focal word(s). This requires:
Choosing a context window (e.g., 6)
Pretrained embeddings (e.g., GloVe)
Transformation matrix (\(\hat{A}\))
Regress context-specific embeddings on covariate(s) of interest
Norm returned \(\hat{\beta}\) matrix for covariate(s) of interest
Utilize bootstrapping and permutation testing to calculate p-values for inference
Research question: Was the word “Trump” used differently post-2016 U.S. election relative to the word “Clinton”?
Question: Do U.S. Democrats and Republicans attach different meanings to the same words?
Regression: \(Focal\_word\_embedding = \beta_0 + \beta_1Republican + \beta_2Male + \epsilon\)
Question: Did the word “empire” take on different meanings in the United States vs. United Kingdom after World War II?
Regression: \(Empire\_word\_embedding = \beta_0 + \beta_1CongressionalRecords + \epsilon\)
Findings: All results replicated (ran only a subset of analyses)
Code:
Outdated functions:
geom_vline() and geom_hline(): use size argument when linewidth should have been used
conText() transform_matrix: Jackknife and bootstrap both used when only one possible
Code organization/repetition:
Authors’ separated code into shorter R scripts
Manageable chunks made interpretation easier
However, resulted in lots of repetition across scripts
When combining, we edited to ensure conciseness and eliminate repetition
Code:
Inconsistent naming:
Naming was sometimes inconsistent across authors’ scripts
The same name was used for different dataframes
Large file sizes resulted in longer run times
Validation via convergent construct (use case 2)
Investigated validation methods cited from Quinn et al, 2010 paper
Determined that “convergent construct” validation would make the most sense
The conText() regression has a variety of parameters; interested in how sensitive results were to different researcher parameter choices
In Trump vs. Clinton example, re-ran with hard_cut = TRUE in order to require that all articles in analysis had a context window of 6 on either side (i.e., precludes target word from being first word)
This choice significantly cut down sample size but findings were robust
Utilize other validation methods listed (semantic, predictive)
Further examine sensitivity of results to researcher decisions (context window, preprocessing, etc.)
Examine US vs. other countries usage of a word more currently relevant than “empire”
Quinn, Kevin M., Burt L. Monroe, Michael Colaresi, Michael H. Crespin, and Dragomir R. Radev. 2010. “How to Analyze Political Attention with Minimal Assumptions and Costs.” American Journal of Political Science 54 (1): 209–28.
Rodriguez, Pedro L., Arthur Spirling, and Brandon M. Stewart. 2023. “Embedding Regression: Models for Context-Specific Description and Inference.” American Political Science Review 117 (4): 1255-1274. doi: 10.1017/S0003055422001228
Rodriguez, Pedro L.; Spirling, Arthur; Stewart, Brandon M., 2023, “Replication Data for: Embedding Regression: Models for Context-Specific Description and Inference”, https://doi.org/10.7910/DVN/NKETXF, Harvard Dataverse, V1, UNF:6:gBkWkhpPxkGmXEddHggmJQ== [fileUNF]